Introduction


In this practical, we will show an example of loading pre-trained word vectors and fine-tune them for the purpose of sentiment classification on movie reviews. We use the following packages:

library(text2vec)
library(tidyverse)
library(tidytext)

Besides these packages, we need to install the TensorFlow and Keras packages for R.

The TensorFlow package provides code completion and inline help for the TensorFlow API when running within the RStudio IDE. The TensorFlow API is composed of a set of Python modules that enable constructing and executing TensorFlow graphs.

Install the TensorFlow R package from GitHub as follows:

# devtools::install_github("rstudio/tensorflow")

Then, use the install_tensorflow() function to install TensorFlow:

library(tensorflow)
# install_tensorflow(package_url = "https://pypi.python.org/packages/b8/d6/af3d52dd52150ec4a6ceb7788bfeb2f62ecb6aa2d1172211c4db39b349a2/tensorflow-1.3.0rc0-cp27-cp27mu-manylinux1_x86_64.whl#md5=1cf77a2360ae2e38dd3578618eacc03b")

The provided url just installs the latest TensorFlow version, you can also run this line without providing any argument to the install_tensorflow function.

Finally, you can confirm that the installation succeeded with:

tmr <- tf$constant("Text Mining with R!")
print(tmr)
## tf.Tensor(b'Text Mining with R!', shape=(), dtype=string)

This will provide you with a default installation of TensorFlow suitable for getting started with the TensorFlow R package. See the article on installation (https://tensorflow.rstudio.com/installation/) to learn about more advanced options, including installing a version of TensorFlow that takes advantage of Nvidia GPUs if you have the correct CUDA libraries installed.

To install the Keras package you first run either of the following lines:

# install.packages("keras")
# devtools::install_github("rstudio/keras")

Then, use the install_keras() function to install Keras. The Keras R interface uses the TensorFlow backend engine by default. This will provide you with default CPU-based installations of Keras and TensorFlow. If you want a more customized installation, e.g. if you want to take advantage of NVIDIA GPUs, see the documentation for install_keras() and the article on installation (https://tensorflow.rstudio.com/installation/).

The ISLR authors also prepared an installation guide to Python, Reticulate and Keras: https://web.stanford.edu/~hastie/ISLR2/keras-instructions.html


Sentiment classification with pre-trained word vectors


Now we have TensorFlow and Keras ready for fine-tuning pre-trained word embeddings for sentiment classification on movie reviews.

Rememebr to load the Keras library:

library(keras)
## 
## Attaching package: 'keras'
## The following objects are masked from 'package:text2vec':
## 
##     fit, normalize

For sentiment classification with pre-trained word vectors, we want to use GloVe pretrained word vectors. These word vectors were trained on Wikipedia 2014 and Gigaword 5 containing 6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors. Download the glove.6B.300d.txt file manually from the website or use the code below for this purpose.

# Download Glove vectors if necessary
# if (!file.exists('glove.6B.zip')) {
#   download.file('https://nlp.stanford.edu/data/glove.6B.zip',destfile = 'glove.6B.zip')
#   unzip('glove.6B.zip')
# }

  1. Use the code below to load the pre-traind word vectors from the file ‘glove.6B.300d.txt’ (if you have memory issues load the file ‘glove.6B.50d.txt’ instead).

# load glove vectors
vectors <- data.table::fread('data/glove.6B.300d.txt', data.table = F, encoding = 'UTF-8')
colnames(vectors) <- c('word', paste('dim',1:300,sep = '_'))

# convert vectors to dataframe
vectors <- as_tibble(vectors)

  1. IMDB movie reviews is a labeled data set available with the text2vec package. This data set consists of 5000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews. Load this data set and convert it to a dataframe.

# load an example dataset from text2vec
data("movie_review")
as_tibble(movie_review)

  1. To create a learning model using Keras, let’s first define the hyperparameters. Define the parameters of your Keras model with a maximum of 10000 words, maxlen of 60 and word embedding size of 300 (if you had memory problems change the embedding dimension to a smaller value, e.g., 50).

max_words <- 1e4
maxlen    <- 60
dim_size  <- 300

  1. Use the text_tokenizer function from Keras and tokenize the imdb review data using a maximum of 10000 words.

# tokenize the input data and then fit the created object
word_seqs <- text_tokenizer(num_words = max_words) %>%
  fit_text_tokenizer(movie_review$review)

  1. Transform each text into a sequence of integers (word indices) and use the pad_sequences function to pad the sequences.

# apply tokenizer to the text and get indices instead of words
# later pad the sequence
x_train <- texts_to_sequences(word_seqs, movie_review$review) %>%
  pad_sequences(maxlen = maxlen)

  1. Convert the sequence into a dataframe.

# unlist word indices
word_indices <- unlist(word_seqs$word_index)

# then place them into data.frame 
dic <- data.frame(word = names(word_indices), key = word_indices, stringsAsFactors = FALSE) %>%
  arrange(key) %>% .[1:max_words,]

  1. Use the code below to join the dataframe of sequences (word indices) from the IMDB reviews with GloVe pre-trained word vectors.

# join the words with GloVe vectors and
# if a word does not exist in GloVe, then fill NA's with 0
word_embeds <- dic  %>% left_join(vectors) %>% .[,3:302] %>% replace(., is.na(.), 0) %>% as.matrix()

  1. Extract the outcome variable from the sentiment column in the original dataframe and name it y_train.

# the outcome variable
y_train <- as.matrix(movie_review$sentiment)

  1. Use the Keras functional API and create a neural network model as below. Can you describe this model?

# Use Keras Functional API 
input <- layer_input(shape = list(maxlen), name = "input")

model <- input %>%
  layer_embedding(input_dim = max_words, output_dim = dim_size, input_length = maxlen,
                  # put weights into list and do not allow training
                  weights = list(word_embeds), trainable = FALSE) %>%
  layer_spatial_dropout_1d(rate = 0.2) %>%
  bidirectional(
    layer_gru(units = 80, return_sequences = TRUE)
  )
max_pool <- model %>% layer_global_max_pooling_1d()
ave_pool <- model %>% layer_global_average_pooling_1d()

output <- layer_concatenate(list(ave_pool, max_pool)) %>%
  layer_dense(units = 1, activation = "sigmoid")

model <- keras_model(input, output)

# model summary

  1. Compile the model with an ‘adam’ optimizer, and the binary_crossentropy loss. You can choose accuracy or AUC for the metrics.

# instead of accuracy we can use "AUC" metrics from "tensorflow.keras"
model %>% compile(
  optimizer = "adam", # optimizer = optimizer_rmsprop(),
  loss = "binary_crossentropy",
  metrics = tensorflow::tf$keras$metrics$AUC() # metrics = c('accuracy')
)

  1. Fit the model with 10 epochs (iterations), batch_size = 32, and validation_split = 0.2. Check the training performance versus the validation performance.

history <- model %>% keras::fit(
  x_train, y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.2
)

plot(history)